This coursework focuses on housing prices, with the main objective being to predict the price of a property based on various inputs. The inputs include features such as the area, the number and types of rooms, and additional factors like the availability of a main road, hot water heating, and more.
The dependent variable is the price, as it is the primary concern for most people searching for a house. The goal of this work is to predict the price based on diverse inputs, which consist of mixed data types, such as:
This project addresses a regression problem because the objective is to predict a numeric value—in this case, the price of the property.
Now we are going to import our dataset into this project.
dt_houses <- fread(file = "./datasets/Regression_set.csv")
I would like to check, if i have some nullish data in my
dataset. I think it is a good idea to go through all rows and colums and
check, if there is a NA. I want to check it with built-in function in R
complete.cases(data_table). This function returns TRUE or FALSE
if row contains a NA value.
nas <- dt_houses[!complete.cases(dt_houses)]
nas
That looks great, now we can explore our dataset :)
Before we will explore our data, I want to import all libraries, which we will probably use:
library(data.table)
library(ggcorrplot)
library(ggExtra)
library(ggplot2)
library(ggridges)
library(ggsci)
library(ggthemes)
library(RColorBrewer)
library(svglite)
library(viridis)
library(scales)
library(rpart)
library(rpart.plot)
I found some helpful functions in R, so we could have a look on our data. We will start with a structure, than we will get some statistic data and take a head() of the data
str(dt_houses)
Classes ‘data.table’ and 'data.frame': 545 obs. of 13 variables:
$ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
$ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
$ bedrooms : int 4 4 3 4 4 3 4 5 4 3 ...
$ bathrooms : int 2 4 2 2 1 3 3 3 1 2 ...
$ stories : int 3 4 2 2 2 1 4 2 2 4 ...
$ mainroad : chr "yes" "yes" "yes" "yes" ...
$ guestroom : chr "no" "no" "no" "no" ...
$ basement : chr "no" "no" "yes" "yes" ...
$ hotwaterheating : chr "no" "no" "no" "no" ...
$ airconditioning : chr "yes" "yes" "no" "yes" ...
$ parking : int 2 3 2 3 2 2 2 0 2 1 ...
$ prefarea : chr "yes" "no" "yes" "yes" ...
$ furnishingstatus: chr "furnished" "furnished" "semi-furnished" "furnished" ...
- attr(*, ".internal.selfref")=<externalptr>
Statistic data:
summary(dt_houses[, .(price, area, bedrooms, bathrooms, stories, parking)])
price area bedrooms bathrooms stories parking
Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.0000
1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000
Median : 4340000 Median : 4600 Median :3.000 Median :1.000 Median :2.000 Median :0.0000
Mean : 4766729 Mean : 5151 Mean :2.965 Mean :1.286 Mean :1.806 Mean :0.6936
3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :13300000 Max. :16200 Max. :6.000 Max. :4.000 Max. :4.000 Max. :3.0000
and this is a sample of our dataset:
head(dt_houses)
I would like to start from density of a main values, which are from my domain knowledge are important in price of the properties
Price density:
ggplot(data = dt_houses, aes(x = price)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) +
labs(
x = 'Price',
y = 'Density'
) +
geom_vline(xintercept = mean(dt_houses$price), linetype="dashed") +
scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
This density plot visualizes the distribution of house prices, showing that most houses are priced around 4-5 million, with a right-skewed distribution (some higher-priced houses pulling the mean up). The dashed vertical line represents the mean price (~5M). The plot highlights that while most houses fall within a moderate price range, some expensive properties extend beyond 10M.
Area density:
ggplot(data = dt_houses, aes(x = area)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) +
labs(
x = 'Price',
y = 'Density'
) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
The area density plot looks similar to price density plot and can also make sense, because if house has a bigger area, the higher cost is quite expected. This plot shows that most houses are having area in range ~3000-5000. But some properties have area more than 12000.
Next plot will visualize the distribution of price depending on area.
ggplot() +
geom_point(data = dt_houses, aes(x = area, y = price, color = parking)) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
This scatter plot visualizes the relationship between house area (x-axis) and price (y-axis), with color indicating the number of parking spaces. It shows a positive correlation between area and price—larger houses tend to be more expensive. However, there is some variability, as some large houses have relatively lower prices. The color gradient suggests that houses with more parking spaces (lighter blue) tend to be higher in price and larger in area.
The next plot, which I am going to do is a boxplot and I want to use bedrooms as a factor variable on x axis and price on y-axis, to get an overall understanding, how amount of bedrooms affect price.
ggplot(data = dt_houses, aes(x = factor(bedrooms), y = price)) +
geom_boxplot() +
theme_minimal()
Boxplot shows, that on average, houses with more bedrooms have higher prices, but around 4-6 bedrooms, 1 quantile stagnates, and so does median price. There are some outliers, but not too much.
It is also interesting to take a look at distribution of bedrooms, so next plot would be a histogram, because I want to know, which amount of bedrooms is the most “popular” in the whole dataset.
ggplot(data = dt_houses, aes(x = bedrooms)) +
geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) +
geom_vline(xintercept = mean(dt_houses$bedrooms), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
mean of the bedrooms:
mean(dt_houses$bedrooms)
[1] 2.965138
From this visualization we can mention, that the most of the houses have 2, 3 or 4 rooms. 1, 5 and 6 rooms are not as popular in this dataset.
Let’s have a look at histogram of stories:
ggplot(data = dt_houses, aes(x = stories)) +
geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) +
geom_vline(xintercept = mean(dt_houses$stories), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
mean(dt_houses$stories)
[1] 1.805505
This plot shows that most popular amount of stories are 1 and 2. 3 and 4 makeing less than 100 houses together.
Bathrooms are also interesting variable, so let’s take a look at histogram and a Boxplot bathrooms and price:
ggplot(data = dt_houses, aes(x = bathrooms)) +
geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) +
geom_vline(xintercept = mean(dt_houses$bathrooms), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
ggplot(data = dt_houses, aes(x = factor(bathrooms), y = price)) +
geom_boxplot() +
theme_minimal()
here it is also almost obvious, that, if we have more bathrooms, price will be also up. Only one disadvantage, that in my dataset I do not have enough data about properties with 3 or 4 bathrooms, I have some on 3, but really luck on 4.
Furnishing is also important, many people search for apartments with furniture, but furniture could be not in a best shape or buyer may do not like the style. So from my opinion, it is not as strong(in prediction), as for example area.
How much real estate furnished or not:
ggplot(data = dt_houses, aes(x = factor(furnishingstatus), fill = factor(furnishingstatus))) +
geom_bar(color="#ced4da", alpha=0.25) +
scale_fill_viridis_d(option = "D") +
labs(title = "Bar Chart with Different Colors",
x = "Furnishing Status",
y = "Count") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
We can see, that most of the houses are semi-furnished. which is also logical, because when we sell a house or apartment, probably we would take in most of the cases the most valuable things for us and furniture included.
Now, it would be great, to look at price and area distribution in differently furnished properties
ggplot(data = dt_houses, aes(y = price, x = area)) +
geom_point(data = dt_houses, aes(y = price, x = area, color = bedrooms)) +
geom_hline(yintercept = mean(dt_houses$price), linetype='dashed') +
facet_grid(.~furnishingstatus) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
scale_color_distiller(type = "seq", palette = "Greens") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
Also, on average, you can notice, that unfurnished houses, are less expensive.
We can also take a look on some pie charts:
dt_mainroad_counts <- as.data.frame(table(dt_houses$mainroad)) #table() - creates frequency table
colnames(dt_mainroad_counts) <- c("mainroad_status", "count")
dt_mainroad_counts$percentage <- round(dt_mainroad_counts$count / sum(dt_mainroad_counts$count) * 100, 1)
ggplot(data = dt_mainroad_counts, aes(x = "", y = count, fill = mainroad_status)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(percentage, "%")),
position = position_stack(vjust = 0.5), color = "white", size = 4) +
theme_void() +
scale_fill_manual(values = c("#F1B147", "#47B1F1")) +
labs(
title = "Distribution of Mainroad Status",
fill = "Mainroad Status"
)
Almost 86 percent of houses have main road, so maybe this won’t be a strong predictor variable.
dt_airconditioning_counts <- as.data.frame(table(dt_houses$airconditioning)) #table() - creates frequency table
colnames(dt_airconditioning_counts) <- c("airconditioning_status", "count")
dt_airconditioning_counts$percentage <- round(dt_airconditioning_counts$count / sum(dt_airconditioning_counts$count) * 100, 1)
ggplot(data = dt_airconditioning_counts, aes(x = "", y = count, fill = airconditioning_status)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(percentage, "%")),
position = position_stack(vjust = 0.5), color = "white", size = 4) +
theme_void() +
scale_fill_manual(values = c("#F1B147", "#47B1F1")) +
labs(
title = "Distribution of Airconditioning status",
fill = "Airconditioning Status"
)
Here 68.4 percent has airconditioning, but I do not know, how it will affect predictions.
I think that would be enough exploration and we can start with our first model.
First, I would like to start pretty simple with linear model.
I consider to take all variables to my model, because they all seem to be very important.
But before we start, I want to introduce a data table, which will be very useful in the end of this course work.
dt_features_performance <- data.table("price_lm_rmse" = c(0, 0, 0, 0, 0), "price_tree_rmse" = c(0, 0, 0, 0, 0), "feature" = c(0, 1, 2, 3, 4))
I will use lm function in R to find needed beta coefficients and create my model
price_lm <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea, data = dt_houses)
summary(price_lm)
Call:
lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + bathrooms +
guestroom + basement + prefarea, data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2619718 -657322 -68409 507176 5166695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42771.69 264313.31 0.162 0.871508
area 244.14 24.29 10.052 < 2e-16 ***
bedrooms 114787.56 72598.66 1.581 0.114445
hotwaterheatingyes 855447.15 223152.69 3.833 0.000141 ***
airconditioningyes 864958.31 108354.51 7.983 8.91e-15 ***
stories 450848.00 64168.93 7.026 6.55e-12 ***
mainroadyes 421272.59 142224.13 2.962 0.003193 **
parking 277107.10 58525.89 4.735 2.82e-06 ***
furnishingstatussemi-furnished -46344.62 116574.09 -0.398 0.691118
furnishingstatusunfurnished -411234.39 126210.56 -3.258 0.001192 **
bathrooms 987668.11 103361.98 9.555 < 2e-16 ***
guestroomyes 300525.86 131710.22 2.282 0.022901 *
basementyes 350106.90 110284.06 3.175 0.001587 **
prefareayes 651543.80 115682.34 5.632 2.89e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1068000 on 531 degrees of freedom
Multiple R-squared: 0.6818, Adjusted R-squared: 0.674
F-statistic: 87.52 on 13 and 531 DF, p-value: < 2.2e-16
We got 0.68 R-squared, which is not that bad for a model just made up. But that’s not all, I will try to do better here, but first, another model.
But I would like to measure performance of my models with RMSE, so I will calculate RMSE for linear model.
price_lm_rmse <- mean(sqrt(abs(price_lm$residuals)))
price_lm_rmse
[1] 797.382
I think this model could perform better, because there some variables which can affect this model not only linearly, but the other way, in this case tree model can show better performance.
In this coursework will be used rpart to create a regression tree.
prices_tree <- rpart(data = dt_houses, formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea, method = 'anova')
prp(prices_tree, digits = -3)
printcp(prices_tree)
Regression tree:
rpart(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + bathrooms +
guestroom + basement + prefarea, data = dt_houses, method = "anova")
Variables actually used in tree construction:
[1] airconditioning area basement bathrooms furnishingstatus parking
Root node error: 1.9032e+15/545 = 3.4921e+12
n= 545
CP nsplit rel error xerror xstd
1 0.304946 0 1.00000 1.00456 0.085202
2 0.094553 1 0.69505 0.72070 0.063329
3 0.053743 2 0.60050 0.62700 0.054969
4 0.026381 3 0.54676 0.59647 0.051655
5 0.024922 4 0.52038 0.59399 0.052224
6 0.022993 5 0.49546 0.58608 0.052820
7 0.021374 6 0.47246 0.56090 0.051090
8 0.015261 7 0.45109 0.55862 0.050557
9 0.013952 8 0.43583 0.56672 0.051222
10 0.012386 9 0.42188 0.55446 0.050699
11 0.010000 10 0.40949 0.54623 0.050358
Now after I have built with the help of rpart tree model based on my dataset, let us explore it:
prices_tree
n= 545
node), split, n, deviance, yval
* denotes terminal node
1) root 545 1.903208e+15 4766729
2) area< 5954 361 6.066751e+14 4029993
4) bathrooms< 1.5 293 3.297298e+14 3773561
8) area< 4016 174 1.437122e+14 3431227
16) furnishingstatus=unfurnished 78 4.036605e+13 2977962 *
17) furnishingstatus=furnished,semi-furnished 96 7.430067e+13 3799505 *
9) area>=4016 119 1.358098e+14 4274118 *
5) bathrooms>=1.5 68 1.746610e+14 5134912
10) airconditioning=no 44 7.024826e+13 4563682 *
11) airconditioning=yes 24 6.373358e+13 6182167 *
3) area>=5954 184 7.161564e+14 6212174
6) bathrooms< 1.5 108 2.869179e+14 5382579
12) airconditioning=no 65 1.170629e+14 4843569
24) basement=no 38 5.226335e+13 4304816 *
25) basement=yes 27 3.824662e+13 5601815 *
13) airconditioning=yes 43 1.224240e+14 6197360 *
7) bathrooms>=1.5 76 2.492851e+14 7391072
14) parking< 1.5 51 7.184700e+13 6859794 *
15) parking>=1.5 25 1.336772e+14 8474878
30) airconditioning=no 10 5.146311e+13 7285600 *
31) airconditioning=yes 15 5.864106e+13 9267729 *
We can see, that we have 31 Nodes, I think for this kind of dataset it may be okay.
Now it would be great to prune the tree, because I do not want my tree to overfit:
plotcp(prices_tree)
This is complexity of this tree. We need the lowest complexity, to get as few leafs as possible to get the best performance, so that tree won’t overfit the data.
prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
prp(prices_tree,digits = -3)
after we pruned the tree, let’s calculate the RMSE for the tree model
prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus", "guestroom", "basement", "prefarea")])
prices_tree_rmse <- mean(sqrt(abs(dt_houses$price - prices_tree_pred)))
prices_tree_rmse
[1] 860.0223
price linear model has a RMSE of
price_lm_rmse
[1] 797.382
price tree model has a RMSE of
prices_tree_rmse
[1] 860.0223
It is surprising for me, as for a person who does not have a lot of experience in modelling, that linear model performs better than tree model by approx. 7.28%.
100 - price_lm_rmse / prices_tree_rmse * 100
[1] 7.283574
collecting data for my statistics in the end
dt_features_performance$price_lm_rmse[dt_features_performance$feature == 0] <- price_lm_rmse
dt_features_performance$price_tree_rmse[dt_features_performance$feature == 0] <- prices_tree_rmse
Here I would like to try all ideas and observations, which I’ve had through my course work. As I have a lot of binary variables and they are already encoded by R’s lm library I would use factor variables such as bedrooms, bathrooms and stories.
I want to use bedroom variable as a factor, to do that I will delete original bedrooms and add instead bedrooms factor.
Let’s try Model with a new factor variable.
price_lm <- lm(formula = price ~ area + factor(bedrooms) + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea, data = dt_houses)
summary(price_lm)
Call:
lm(formula = price ~ area + factor(bedrooms) + hotwaterheating +
airconditioning + stories + mainroad + parking + furnishingstatus +
bathrooms + guestroom + basement + prefarea, data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2603715 -652407 -78857 515376 5210924
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 357446.85 775559.37 0.461 0.645068
area 244.65 24.48 9.992 < 2e-16 ***
factor(bedrooms)2 -114030.93 766523.74 -0.149 0.881797
factor(bedrooms)3 44379.14 767625.38 0.058 0.953919
factor(bedrooms)4 95121.74 776261.47 0.123 0.902519
factor(bedrooms)5 225794.30 840612.20 0.269 0.788337
factor(bedrooms)6 725852.04 1075980.80 0.675 0.500228
hotwaterheatingyes 855162.11 225019.54 3.800 0.000161 ***
airconditioningyes 865512.39 108940.97 7.945 1.19e-14 ***
stories 447940.57 66210.11 6.765 3.54e-11 ***
mainroadyes 420192.91 143761.60 2.923 0.003617 **
parking 277396.50 58975.46 4.704 3.27e-06 ***
furnishingstatussemi-furnished -42608.78 117741.15 -0.362 0.717583
furnishingstatusunfurnished -412029.93 126841.54 -3.248 0.001235 **
bathrooms 995395.42 104634.07 9.513 < 2e-16 ***
guestroomyes 302160.05 132199.82 2.286 0.022672 *
basementyes 348044.60 111709.52 3.116 0.001936 **
prefareayes 646125.01 117161.83 5.515 5.48e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1071000 on 527 degrees of freedom
Multiple R-squared: 0.6822, Adjusted R-squared: 0.6719
F-statistic: 66.54 on 17 and 527 DF, p-value: < 2.2e-16
price_lm_rmse <- mean(sqrt(abs(price_lm$residuals)))
price_lm_rmse
[1] 796.3947
Now the RMSE is a little bit better. Approximately by 0.12%. Next I am going to test this variable on the tree model.
100 - price_lm_rmse / 797.382 * 100
[1] 0.1238174
prices_tree <- rpart(data = dt_houses, formula = price ~ area + factor(bedrooms) + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea, method = 'anova')
prp(prices_tree, digits = -3)
I think, that in feature enginieering, I won’t plot any tree complexity and explore tree itself, because here the main focus is on the benchmarking and comparing two models with new features. Let’s prune the model and measure RMSE
# pruning
prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
# predicting
prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus", "guestroom", "basement", "prefarea")])
#calculating RMSE
prices_tree_rmse <- mean(sqrt(abs(dt_houses$price - prices_tree_pred)))
prices_tree_rmse
[1] 860.0223
It performs the same, and I think it should be like that, because tree is not sensible for factor variables. It is still a number. It is interpreted other way by linear Model, but for the tree it is the same.
Here I am adding my variables to create a small chart in the end.
dt_features_performance$price_lm_rmse[dt_features_performance$feature == 1] <- price_lm_rmse
dt_features_performance$price_tree_rmse[dt_features_performance$feature == 1] <- prices_tree_rmse
With new ‘factor(bedroom)’ feature, linear model performs better: 796.3947 - RMSE On the other side tree model with new feature has not improved.
Linear model is still better, but may be there is some chances, we have 3 more features.
what if we will try to bring the area variable closer to Gaussian with log transformation, because area density is skewed to the left, log transformation can help us to normalize the variable.
I am going to create a new column in my data table, which will be called area_log and will contain ln(area[i]), where i is the index of the row.
dt_houses[, area_log := log(area)]
Let’s also visualize it. Here function looks much balanced and I think it will work better.
ggplot(data = dt_houses, aes(x = area_log)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) +
labs(
x = 'Price',
y = 'Density'
) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
Now I want to run the model with a second new feature.
price_lm <- lm(formula = price ~ area + factor(bedrooms) + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea + area_log, data = dt_houses)
summary(price_lm)
Call:
lm(formula = price ~ area + factor(bedrooms) + hotwaterheating +
airconditioning + stories + mainroad + parking + furnishingstatus +
bathrooms + guestroom + basement + prefarea + area_log, data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2635466 -652907 -65980 484713 5180753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.627e+06 3.583e+06 -2.408 0.016390 *
area 3.981e+01 8.340e+01 0.477 0.633303
factor(bedrooms)2 -6.539e+04 7.627e+05 -0.086 0.931708
factor(bedrooms)3 1.017e+05 7.639e+05 0.133 0.894098
factor(bedrooms)4 1.312e+05 7.723e+05 0.170 0.865155
factor(bedrooms)5 3.657e+05 8.380e+05 0.436 0.662678
factor(bedrooms)6 7.477e+05 1.070e+06 0.699 0.485144
hotwaterheatingyes 8.513e+05 2.238e+05 3.803 0.000160 ***
airconditioningyes 8.208e+05 1.098e+05 7.478 3.17e-13 ***
stories 4.448e+05 6.587e+04 6.753 3.84e-11 ***
mainroadyes 3.474e+05 1.458e+05 2.383 0.017527 *
parking 2.701e+05 5.873e+04 4.598 5.35e-06 ***
furnishingstatussemi-furnished -6.605e+04 1.175e+05 -0.562 0.574204
furnishingstatusunfurnished -4.314e+05 1.264e+05 -3.413 0.000691 ***
bathrooms 9.904e+05 1.041e+05 9.514 < 2e-16 ***
guestroomyes 2.435e+05 1.335e+05 1.824 0.068656 .
basementyes 3.628e+05 1.113e+05 3.261 0.001184 **
prefareayes 6.678e+05 1.169e+05 5.715 1.84e-08 ***
area_log 1.193e+06 4.644e+05 2.568 0.010508 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1066000 on 526 degrees of freedom
Multiple R-squared: 0.6861, Adjusted R-squared: 0.6754
F-statistic: 63.88 on 18 and 526 DF, p-value: < 2.2e-16
price_lm_rmse <- mean(sqrt(abs(price_lm$residuals)))
price_lm_rmse
[1] 792.3163
Success! It makes less errors. Previous we had RMSE of 796.3947, now it is 792.3163. Also 0.51% performance improvement.
100 - price_lm_rmse / 796.3947 * 100
[1] 0.512108
It is tree’s turn, I want to run new feature on tree Model.
prices_tree <- rpart(data = dt_houses, formula = price ~ area + factor(bedrooms) + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea + area_log, method = 'anova')
prp(prices_tree, digits = -3)
Now I will prune the tree and then calculate the RMSE for the model with this new feature.
prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
prp(prices_tree,digits = -3)
and calculating the error
prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus", "guestroom", "basement", "prefarea", "area_log")])
prices_tree_rmse <- mean(sqrt(abs(dt_houses$price - prices_tree_pred)))
prices_tree_rmse
[1] 860.0223
Yep, there is no gain in performance, and I could probably say why. Linear model gains performance when we normalize variables, because this algorithm is sensitive to Gaussian, but the tree model, does not “care” so much about density of the variables, because it does not calculate “distance” between points and it is great in working with non-linar dependencies. This is my prediction, but I could be also wrong.
So there are RMSE from linear model: 792.3163 and RMSE from tree: 850.561. Linear model’s performance is better than tree’s
dt_features_performance$price_lm_rmse[dt_features_performance$feature == 2] <- price_lm_rmse
dt_features_performance$price_tree_rmse[dt_features_performance$feature == 2] <- prices_tree_rmse
I think, this could be a good Idea to take a loot at a correlation between variables, but from Data exploration I can already say, that area correlates with price.
Here we are, correlation plot:
ggcorrplot(corr = cor(dt_houses[, .(price, area, bedrooms, bathrooms, stories, parking)]),
hc.order = TRUE,
lab = TRUE)
Hm, correlation plot does not look as great, as I have expected, but the strongest correlation with price is area and amount of bathrooms.
I got an Idea, we have bathrooms, and they are in range from 1 to 4.What if we will treat each amount of bathrooms as a factor variable. Because it is possible that home with 2 bathrooms is drastically more expensive than a house with 1, and the one with 3 bathrooms is super costly
price_lm <- lm(formula = price ~ area + factor(bedrooms) + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + factor(bathrooms) + guestroom + basement + prefarea + area_log, data = dt_houses)
summary(price_lm)
Call:
lm(formula = price ~ area + factor(bedrooms) + hotwaterheating +
airconditioning + stories + mainroad + parking + furnishingstatus +
factor(bathrooms) + guestroom + basement + prefarea + area_log,
data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2611477 -654190 -71264 502239 5253586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.206e+06 3.598e+06 -2.281 0.022960 *
area 2.480e+01 8.385e+01 0.296 0.767509
factor(bedrooms)2 -5.979e+04 7.613e+05 -0.079 0.937432
factor(bedrooms)3 1.192e+05 7.625e+05 0.156 0.875794
factor(bedrooms)4 1.483e+05 7.709e+05 0.192 0.847515
factor(bedrooms)5 3.700e+05 8.374e+05 0.442 0.658769
factor(bedrooms)6 7.965e+05 1.069e+06 0.745 0.456418
hotwaterheatingyes 8.678e+05 2.236e+05 3.881 0.000117 ***
airconditioningyes 8.282e+05 1.100e+05 7.532 2.21e-13 ***
stories 4.440e+05 6.599e+04 6.728 4.51e-11 ***
mainroadyes 3.479e+05 1.455e+05 2.391 0.017164 *
parking 2.613e+05 5.880e+04 4.445 1.08e-05 ***
furnishingstatussemi-furnished -6.554e+04 1.179e+05 -0.556 0.578442
furnishingstatusunfurnished -4.361e+05 1.267e+05 -3.442 0.000624 ***
factor(bathrooms)2 9.004e+05 1.208e+05 7.453 3.81e-13 ***
factor(bathrooms)3 2.166e+06 3.554e+05 6.095 2.13e-09 ***
factor(bathrooms)4 4.801e+06 1.090e+06 4.404 1.29e-05 ***
guestroomyes 2.479e+05 1.333e+05 1.860 0.063503 .
basementyes 3.694e+05 1.111e+05 3.324 0.000949 ***
prefareayes 6.795e+05 1.168e+05 5.819 1.03e-08 ***
area_log 1.269e+06 4.665e+05 2.721 0.006732 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1064000 on 524 degrees of freedom
Multiple R-squared: 0.6885, Adjusted R-squared: 0.6766
F-statistic: 57.9 on 20 and 524 DF, p-value: < 2.2e-16
price_lm_rmse <- mean(sqrt(abs(price_lm$residuals)))
price_lm_rmse
[1] 787.8693
And we gain a little bit more performance. This is really great. Approx 1.19% better than the first model without features. But not every factor is used, may be there is a big difference between 1 and 2 bathrooms, That may be because the luck of data, because I have less than 5 units with 3 or 4 bathrooms overall in my dataset.
100 - price_lm_rmse / 797.382 * 100
[1] 1.192997
Let us try tree model now.
prices_tree <- rpart(data = dt_houses, formula = price ~ area + factor(bedrooms) + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + factor(bathrooms) + guestroom + basement + prefarea + area_log, method = 'anova')
prp(prices_tree, digits = -3)
Now we have to prune the tree and then make predictions with RMSE calculations.
prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus", "guestroom", "basement", "prefarea", "area_log")])
prices_tree_rmse <- mean(sqrt(abs(dt_houses$price - prices_tree_pred)))
prices_tree_rmse
[1] 842.1272
This is awesome, we are making ~ 17.8951 less errors, this is 2.08% less errors.
100 - prices_tree_rmse / 860.0223 * 100
[1] 2.080773
This becomes interesting. While linear model has improved by 1.19%, tree model made bigger gain in performance: ~2.08%. This is ~2 times linear model gains.
dt_features_performance$price_lm_rmse[dt_features_performance$feature == 3] <- price_lm_rmse
dt_features_performance$price_tree_rmse[dt_features_performance$feature == 3] <- prices_tree_rmse
I want to treat my stories also as a categorical factor variable and to do that I will again use factor() function in R. I will try this feature in linear and tree model.
First, I would like to start as usual with linear model.
# calculating and running model
price_lm <- lm(formula = price ~ area + factor(bedrooms) + hotwaterheating + airconditioning + factor(stories) + mainroad + parking + furnishingstatus + factor(bathrooms) + guestroom + basement + prefarea + area_log, data = dt_houses)
summary(price_lm)
Call:
lm(formula = price ~ area + factor(bedrooms) + hotwaterheating +
airconditioning + factor(stories) + mainroad + parking +
furnishingstatus + factor(bathrooms) + guestroom + basement +
prefarea + area_log, data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2652504 -649820 -77111 493086 5266659
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.345e+06 3.627e+06 -2.025 0.043379 *
area 2.909e+01 8.405e+01 0.346 0.729365
factor(bedrooms)2 -4.467e+04 7.622e+05 -0.059 0.953291
factor(bedrooms)3 1.732e+05 7.653e+05 0.226 0.821019
factor(bedrooms)4 2.245e+05 7.756e+05 0.289 0.772357
factor(bedrooms)5 4.580e+05 8.431e+05 0.543 0.587197
factor(bedrooms)6 8.987e+05 1.075e+06 0.836 0.403518
hotwaterheatingyes 8.801e+05 2.252e+05 3.908 0.000105 ***
airconditioningyes 8.176e+05 1.106e+05 7.392 5.80e-13 ***
factor(stories)2 3.474e+05 1.205e+05 2.884 0.004091 **
factor(stories)3 8.448e+05 2.079e+05 4.064 5.56e-05 ***
factor(stories)4 1.380e+06 2.156e+05 6.400 3.46e-10 ***
mainroadyes 3.557e+05 1.459e+05 2.438 0.015118 *
parking 2.577e+05 5.927e+04 4.348 1.65e-05 ***
furnishingstatussemi-furnished -6.013e+04 1.182e+05 -0.509 0.611051
furnishingstatusunfurnished -4.422e+05 1.270e+05 -3.482 0.000540 ***
factor(bathrooms)2 8.894e+05 1.216e+05 7.317 9.62e-13 ***
factor(bathrooms)3 2.175e+06 3.563e+05 6.106 2.00e-09 ***
factor(bathrooms)4 4.715e+06 1.096e+06 4.303 2.01e-05 ***
guestroomyes 2.451e+05 1.336e+05 1.834 0.067153 .
basementyes 3.782e+05 1.119e+05 3.379 0.000782 ***
prefareayes 6.763e+05 1.186e+05 5.704 1.96e-08 ***
area_log 1.216e+06 4.700e+05 2.587 0.009940 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1065000 on 522 degrees of freedom
Multiple R-squared: 0.6891, Adjusted R-squared: 0.6759
F-statistic: 52.58 on 22 and 522 DF, p-value: < 2.2e-16
price_lm_rmse <- mean(sqrt(abs(price_lm$residuals)))
price_lm_rmse
[1] 784.9238
There is a little performance gain in linear model. It is 0.37%. Now let’s test this feature on Tree.
100 - price_lm_rmse / 787.8693 * 100
[1] 0.3738567
After my previous experience with bathrooms, I think here we could also see some performance gain, and I think it will be probably better than by linear model. I want now to run the tree model and then prune and predict, then measure the RMSE:
prices_tree <- rpart(data = dt_houses, formula = price ~ area + factor(bedrooms) + hotwaterheating + airconditioning + factor(stories) + mainroad + parking + furnishingstatus + factor(bathrooms) + guestroom + basement + prefarea + area_log, method = 'anova')
prp(prices_tree, digits = -3)
Now I am pruning and predicting. Then calculating RMSE, which I want to compare with the previous result.
# pruning
prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
# predicting
prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus", "guestroom", "basement", "prefarea", "area_log")])
# calculating RMSE
prices_tree_rmse <- mean(sqrt(abs(dt_houses$price - prices_tree_pred)))
prices_tree_rmse
[1] 842.1272
and it seems that, there is no gain in performance if we use stories as a factor in the tree model.
With feature number 4, we have seen the gain in linear model, but the tree model, did not gain any performance. I think that the stories were not very strong variable in Tree model and in Linear model also, because in compare with such a variable like bathrooms, stories has gained 3 times for linear model and nothing for tree model.
After we included feature 4, linear model is still performing better than the tree model with the overall performance of 784.923 and 842.1271 respectively.
dt_features_performance$price_lm_rmse[dt_features_performance$feature == 4] <- price_lm_rmse
dt_features_performance$price_tree_rmse[dt_features_performance$feature == 4] <- prices_tree_rmse
Now when I have my data, this is my conclusion plot:
ggplot() +
geom_point(data = dt_features_performance, aes(x = feature, y = price_lm_rmse),
size = 4, color = "#1f77b4", alpha = 0.8) +
geom_line(data = dt_features_performance, aes(x = feature, y = price_lm_rmse),
color = "#1f77b4", linewidth = 1) +
geom_point(data = dt_features_performance, aes(x = feature, y = price_tree_rmse),
size = 4, color = "#ff7f0e", alpha = 0.8) +
geom_line(data = dt_features_performance, aes(x = feature, y = price_tree_rmse),
color = "#ff7f0e", linewidth = 1) +
labs(title = "Performance with Amount of Features",
x = "Amount of Features",
y = "Performance (RMSE)") +
theme_minimal() +
theme(
axis.line = element_line(color = "#000000"),
text = element_text(size = 14),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5)
)
With this plot I wanted to show you my progress in creating features. On the x-axis you can see the number of features, from 0 to 4, because I have included the raw model in this plot. On the y-axis is the RMSE of the models.
The orange line is the tree model and the blue line is the linear model.
With more features overall, both models could perform better, but for this dataset and my implementation, the linear model performs better. When I started, I thought the tree model would perform much better.
Finally, I would like to mention that the tree model is lower in performance, but there was a really good boost with feature number 3. While all 4 features have improved the linear model incrementally with each new feature.